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1. Introduction 


Computer vision has been regarded as one of the most complex and computationally intensive problems. The 
algorithms involved employ a vwy broad spectrum of techniques from several areas such as signal processing, 
advanced mathematics, graph theory, and artificial intelligence. These algmithms are, in general, charactaized by 
massive paralleUsm. For low level processing, spatial decomposition of image provides a natural way of generaUng 
parallel For higher level processing operations, parallelization may also be based on other image characteris- 
tics. The multi-dimensional divide-and-conquer paradigm [3] is an attractive mechanism for providing parallelism 
in both of the above cases. 

There is no general definition of an Integrated Vision System (IVS). However, an appUcation dependent 
definition of an IVS is possible. For example, an object recognition, system that takes an image (or a set of images) 
of an object as input and produces as an output a description of the object can be considered an IVS. However, a 
systm (or an algorithm) that takes an image input and produces its Discrete Fourier Transform (DFT) is not con- 
sidered an rVS though computing DFT itself may be a part of an IVS. Therefore, IVS can be viewed as a system 
which employs a number of vision algorithms in a systematic way to produce a specified output In this paper we are 
in an IVS from the viewpoint of complexity and execution of the necessary computations. An architec- 
ture for vision must be powerful as weU as general enough to efficiently execute algorithms for any given computer 
visitMi problem. Researchers in vision and architecture communities arc recognizing the need for architectures that 
ate suited fw IVSs rather than those architectures that are good for a few ^iplications but arc too rigid to perform 
any other plications efficiently. In [4] Weems et al. present an integrated image understanding benchmark that 
consists of algorithms that may comprise an object recognition system. 

The advent of VLSI technology has provided architects to produce high poformance chips for specific appli- 
cations. But these special purpose chips can only be used in an IVS as accelerators of qjccific algorithms (e.g., con- 
volution chips or FFT chips). Another use of VLSI technology has been to create massively parallel Single Instruc- 
tion Multiple Data (SIMD) processors for vision and other applications. Massively parallel SIMD processors are 
well suited few low level and well structured vision algorithm that exhibit spatial parallelism at pixel level. However, 
such architectures are not weU suited fw for high level vision algorithm because high level vision algorithms require 
non-uniform processing, more complex data structures and data dependent decision making capabilities. Further- 
more, map ping a parallel algorithm on such architectures becomes really inefficient when the problem size does not 
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match with the available processor size and when its communication requirements do not match with the underlying 
interconnection structure of the parallel processor machuie.. 

Meshes, array processtws, hypercubes and pyramids are some of the most common SIMD parallel processors 
IHoposed for image analysis and processing. In meshes, the processing elements are arranged in a square array. 
Examples of mesh connected computers include CLIP4 [5, 6, 7], and the MPP [8, 9, 10]. In [1 1], Ahuja and Swamy 
proposed multiprocessor pyramid architecture of the divide-and-conquer based approach model of computation. 
Such pyramids are, therefore, natural candidates for executing divide-and-conquer algwithms that closely minor the 
flow of infMmation in these algorithms. Example of other pyramid architectures include PAPIA [12], SPHINX [13], 
MPP pyramid [14], and HCL Pyramid [15, 16]. However, design <rf an integrated vision system requires a greater 
flexibility, partitionability, and reconfigurability than is offered by regular array, mesh connected or pyramid struc- 
tures as in [1]. For this reason other multiprocessor architectures and parallel algorithms have been pro- 

posed [17,18,19,20], some of which discuss the flexibility, partitiotjability, and reconfigurability issues. CMU 
Warp processor Rl, 22, 23, 24] is another machine proposed and built fw image understanding. The machine has a 
programmable systolic array of Unearly connected cells. The array can perform low level and high level vision 
algorithms in systolic mode for low level operations and in MIMD mode for high level operations. There are unique 
features in the Warp processor that are not present in other architectures. The machine can be reconfigured in 
several modes [25]. The UMass image understanding architecture is based on Content Addressable Array Parallel 
Processor for low level vision and a MIMD parallel processor for high level vision [26]. 

The effectiveness and performance of architectures such as pyramid, array processors, and meshes is limited 
as architectures for integrated vision systems due to several reasons. First, they are mosdy suitable for SIMD type 
of algorithms which only constitute low level vision operations. Smmdly, the architectures are inflexible due to the 
rigid interconnecti o ns, niird, the number of processors needed to solve a problem of reasonable size is thousands. 
Such a large number of processors is not only cost prohibitive, but the processors themselves cannot be very power- 
ful and can have only limited feature due to technological limitations. Fourth, it is normally assumed that the prob- 
lem size exactly matches the number of processors available. Most of the time it is not clear how to adapt algo- 
rithms so that problems of different sizes can be solved using the ssme number of proc^rs. Finally, the problem 
of input-output of data and fault-tolerance is rarely addressed in any of these architectures. It is impertant to note 
that no matter how fast or powerful a particular architecture is, its utilization can be limited by the bandwidth of the 
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I/O. Furthermore, a failure normally either results in a breakdown of the entire system or the performance degrades 
tremendously. It is important that the architecture provide for graceful degradation. Graceful degradation can be 
achieved by providing flexibility in the interconnection and a capability for dynamic reconfiguration and partitioning 

of the architecture. 

In this paper, we present a parallel architecture called NETRA for IVSs which is intended to provide the 
above flexibility. The architecture was originally proposed by Sharma, Patel and Ahuja [1]. NETRA is a recur- 
sively defined tree-type hierarchical architecture each oi whose leaf nodes consists of a cluster of processors con- 
nected with a programmable crossbar with selective broadcast capability. The internal nodes are scheduling proces- 
sors whose function is task scheduling, load balancing, and global memory management. All the scheduling proces- 
sors and the cluster processors are connected to a global memory through a multistage circuit switched network. 
The processors in clusters can operate in SIMD, MIMD or systolic mode, and therefore, are suitable for both low 
level as well as high level vision algorithms. 

In Section 2 we propose a model of computation for integrated vision systems. The model is discussed from 
the parallel prwessing perspective. Using the model we derive desired features and capabilities of a parallel archi- 
tecture for IVSs. Section 3 presents the architecture of NETRA and describes its components and their functions in 
detaU. Then the architecture is critically examined with respect to the IVS requirements. In Section 4 we present 
methods to map parallel algorithms on NETRA. We also discuss the alternative communicadon strategies in 
NETRA and present a qualitative evaluation of the strategies. The algorithms are classified according to their com- 
putation and commimicadon requirements for parallel processing. Secdon 5 presents analysis of altemadve inter- 
cluster communication strategies in NETRA and discusses a methodology to evaluate a parallel algorithm which has 
been mapped across clusters. Finally, a summary is presented in Secdon 6. 

2. Model of Computation for Integrated Vision Systems 

There are two types of parallelism available in vision tasks. Rrst, Spatial Parallelism, in which the same 
operation is tqyplied to all parts of the image data. That is, the data is divided into many granules and distributed to 
different which may execute on different processors in parallel. Most low level vision algorithms exhibit 

this type of paraDelism. However, different tasks may be performed sequentially in time on the same granules of 
data. Each such tsk operates on the ouqiut dau of the previous task. Therefore, the type of data, data structures, etc.. 
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may be different for each task in the system but each form of data can be partitioned into several granules, to be pro- 
cessed in parallel. For example, consider one fVS that performs object recognition. Tlie input image is smoothed 
using some filtering operatic, then on the smoothed image an operalw is applied for feature extraction, features 
with similar characteristics are grouped, then matching with the models is performed. Each of these tasks takes the 
output trf the i»evious tasks as its input and produces an output which becomes the input fw the next task. 

Second form of paraUelism calkd Temporal Parallelism is available when these tasks are repeated on a time 
sequence of images or on different resolutions of images. For example, the system in which the motion of a moving 
objea is fitom an image sequence performs the same set of computation on all image frame(s) in the 

sequence. The jMt)cessing of each frame ot sets of frames can be done in parallel. 

Figure 1 shows a computational model of IVS which incorporates the above mentioned characteristics. Each 
pipeline shows a number of tasks applied to a set of inputs. Each block in the pipeline represents one task. The 
input to the first task in a pipeline is the image, and the input to the rest of the tasks is the output of the previous task. 
Entire pipeline of is repeated at different images in time and/or resolution. Each task is decomposed into sub- 
tasks to be performed in parallel. For example, T j is one task, and T i (d i ) is a subtask of T j (grating on data 
granule d i . The figure shows m tasks in the pipeline. The number of subtasks depends on the amount of data in a 
granule and numbw of available processors. Dij+\ represents data transfer from tadc Ti to task Ti+\ in the pipe- 
line. 

2.1. Data Dependencies 

Exi st ence of <g patial and temporal parallelism may also result in two types of data dependencies, namely, spa- 
tial data dependency and temporal data dependency. The spatial data dependency can be classified into intratask 
data (fcpendency and intertask data dependency. Intiatask data dependraicy arises when a s« of subtasks need to 
exchange data in onler to execute a task in parallel. The exchange of data may be needed during the execution of the 
algorithm, ot to combine their partial results, or both. Therefore, each task itself is a collection of subtasks which 
ma y be represented as a graph with nodes representing the subtasks and edges representing communication between 
«ihraA<! Intertask data dependency denotes the transfer and reorganization of data to be passed onto the next task in 
the pipeline. This may be done by exchanging data between subtasks of the current tasks and the subtasks of the 
next task, or by collection and reorgaiuzation of the output data of current task and then redistribution of the data. 
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Figure 1 : A Computational Model of an Integrated Vision System 
The choice and method depends on the underlying parallel architecture and mapping of the algorithms. Temporal 
dependency is similar to spatial data dependency except that some form of output generated by tasks executed 
on the previous image frames may be needed by one or more tasks executing on the current image frames. A simple 
example of such a dependency is the system for motion estimation in which features from the previous image 
frames are needed in the processing of the current image frames so that features can be matched to establish 
correspondeiKes between features of different time frames. 
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The total cwnputation to execute one pipeline includes time to input data, time to ouq>ut data and results, sum 
of the times to execute all tables in the pipeline (which includes computation time of subtasks and communication 
timff. between subtasks) and, data transfer and reorganization time between two successive tasks. Let’s denote t^p 
as computation time for a subtask, tcomm “ total communication time for a task, as d^ input time, tout as data 
output ?»"»». and as data transfer and reorganization time. Then the time to complete task l, denoted as X,- is given 

by 

X/ = MAX tepiTiidj)) + tcommiTi) (D 

lijSfii 

Total to execute one pipeline including the input ami <Hi4>ut of the data is given by 

fto/ ~ ^in ^out (2) 

i=l i=l 

US now consider some characteristics of the algorithms involved in IVS, and using the above model deter- 
mine the desired features and capabilities of a multiprocessor architecture suitable for IVS. First, an IVS involves 
algorithms from all levels of processing, i.e., an IVS normally includes low, intermediate and high level vision algo- 
rithms. Typically, the first few tasks of the pipeline require low level algorithms and last few require high level 
algorithms. The low level algorithms are well understood and well defined. They are normally data independent, 
have tegular structure, and spatial parallelism is mostly available at pixel level They are well suited for both SIMD 
and MIMD type of processing. If comm un ication between processors is fast enough, almost linear g)^ups are pos- 
sible by using multiple processors. Therefore, an architecture for IVS should be capable of efficiently executing low 
level algorithms and algorithms suited for SIMD type of processing. Also, data I/O should not be a bottleneck 
because otherwise, speedups through parallelism can be nullified. Examples of low level algorithms include various 
transforms, filtering algwithms, convolution algorithms etc. 

High level algorithms are not well catalogued. They are normally global data dependent, involve more com- 
plex structures (compared to pixel representation), and need varying amount of ctxnmunication for parallel pro- 
cessing. These type of algorithms are better suited for MIMD type of processing. Hence, the architecture should be 
capable of executing MIMD algorithms efficiendy. 
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? Desired Features and Capabilities of Parallel Architectures for IVS 

(1) Reconfigurability : From the model and the above discussion it is clear that the architecture should be capable 
of executing both SIMD and MIMD type of algorithms efficiently. That is, it should be possible to 
reconfigure the architecture such that each algorithm can be implemented efficiently using the most suited 
mode of computation. 

(2) Flexible Communication : The communication requirements vary for different algorithms. The communica- 
tion pattern between processors executing subtasks of a larger task depends on the algorithm involved in the 
task. If the connectivity between processors is too rigid then the communication overhead of intratask and 
intertask communication may become prohibitive. Therefore, it is desirable that the communication be flexi- 
ble in order to provide most efficient communication with tow overhead. 

(3) Resource Allocation and Partitionability : As we discussed earlier, there are several tasks with vastly different 
characteristics and computational requirements in an IVS. These tasks need to exist simultaneously in the sys- 
tem. Therefore, the system should be partitionable into many independently controlled subsystems to execute 
each task. Since the high level algorithms exhibit varying level <rf parallelism and data dependent perfor- 
mance, it should be possible to allocate resources (such as processcws, memory) dynamically to meet the per- 
formance requirements. 

(4) Load Balancing and Task Scheduling : Load balancing and task scheduling are very important, specially for 
high level vision algorithms. High level vision algorithms are data dependent, and therefore, in order to obtain 

utilization of resources and better speedups, dividing the computation equaUy among the processor is 
critical [27]. The underlying architecture on which load balancing is done and the type of algorithm(s) 
involved contribute significantly to how well load balancing can be achieved. In tow level algorithms since 
the computations are data independent, partitioning data equally among the processors normally balances the 
load among the processors. However, for high level algorithms more sophisticated load balancing and 
scheduling strategies are needed. The architecture should include features such that it is easy to perform load 
balancing and task scheduling and that the overhead of doing so is minimal. 

(5) Topology and Data Size Independent Mapping : For a system as complex as an IVS, if the underlying archi- 
tecture is rigid such that the problem size that can be solved on it dependent on the size of the architecture, 
the effectiveness of the architecture for an IVS will diminish. 
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(6) Input-Ouq)ut of Data : It is most often the case that an architecture is able to perform very well on some algo- 
rithms and high speedups are obtained but input-output (I/O) of data is inefficient I/O is an integral part of a 
system and if it is a bottleneck then performance of the system will be limited. 

(7) Fault-Tolerance : Fault-tolerance is an important part of a system of such complexity. A failure in a jffocessor 
or communication structure should not affect the performance drastically which is normally the case when 
rigid interconnections are preset between processors. The architecture should iwovide ftw graceful degrada- 
tion in case of failures. 

3. Architecture of NETRA 

In this section we describe the architecture of NETRA and its features. We examine and evaluate the charac- 
teristics of the architecture using the criteria developed in the previous section. 

Hgure 2 shows the architecture of "NETRA" for integrated vision systems. The architecture consists of the 
following components :- 

(1) A large number (1000 - 10000) o( Processing Elements (PEs), organized into clusters of 16 to 64 PEs each. 

(2) A tree of Distributing-and-Scheduling-Processors (DSPs) that make up the task distribution and control struc- 
ture of the multiprocessor. 

(3) A paraUel pipelined shared Globed Memory and a Globed Interconnection that links the PEs and DSPs to the 
Global Memory. 

3.1. Processor Clusters 

The clusters consist of, 16 to 64 PEs. each with its own pro^am and data memory. They form a layer below 
the DSP-tree, with a leaf DSP associated with each cluster. PEs within a cluster also share a common data memory. 
The PEs, the DSP associated with the clustCT, and the shared memory are connected together with a crossbar switch. 
The crossbar switch permits point-to-point communications as well as selective broadcast by the DSP or any of the 
PEs. Figure 3 shows the cluster organization. A 4x4 crossbar is shown as an example of the implementation of the 
crossbar switch. The crossbar design consists of pass Uansistors connecting the input and output data lines. The 
switches are controlled by control bits indicating the connection pattern. If a processor of DSP needs to broadcast 
then aU the control bits in its row are made one. In order to connect processor Pj to processor P j, control bit (i j) is 
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DSP : Distributing and Scheduling Processor 
C : Processor Cluster M : Memory Module 

Figure 2 : Organization of NETRA 

set to one and rest of the control bits in row i and column j are off. 

Clusters can operate in an SIMD mode, a systolic mode, or an MIMD mode. Each PE is a general purpose 
processor with a high speed floating point capability. In an SIMD mode, PEs in a cluster execute identical instruc- 
tion streams from private memories in a lock-step fashion. In the systolic mode, PEs repetitively execute an instruc- 
tion or set of instruction on data streams from one or mote PEs, In both cases, communication between PEs is syn- 
chronous. In the MIMD mode PEs asynchronously execute instruction streams resident in their private memories. 
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PE : PROCESSOR M ; LOCAL MEMORY 

CDM : COMMON DATA MEMORY 
Figure 3 : Organization of Processor Cluster 

The streams may not be identical. In order to synchronize the procKSOTS in a cluster, a synchronization bus is pro- 
vided which is used by processors to indicate to the DSP that a processor(s) has finished its computation or a proces- 
sor wants to change the communication pattern. The DSP can either poll the processors or the processors can inter- 


rupt the DSP using the synchronization bus. 
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3.1.1. Crossbar Design 

There is no arbitration in the crossbar switch. That is, the intwconnection between processor has to be pro- 
grammed befwe processors can communicate with each other. Programming a crossbar requires writting a com- 
munication pattern into the control memory of the crossbar. A processor can alter the communication pattern by 
updating the control memory as long as it does not conflict with the existing communication pattern. The DSP asso- 
ciated with the cluster can write into the control memory to alter the communicalion pattern. The most common 
communication patterns such as linear arrays, trees, meshes, pyramids, shuffle-exchanges, cubes, broadcast, can be 
stored in the merntwy of the crossbar. These patterns need not be stqrplied externally. Therefore, switching to a dif- 
ferent pattern in the crossbar can be fast because switching only requires writing the patterns into the control bits of 
the crossbar switches from its control memory. 

The advantages of such a crossbar design are the following: first, since there is no arbitration, the crossbar is 
relatively faster than one which provides arbitration because switching and arbitration delays are avoided. Secondly, 
it is easier to design and implement the crossbar because arbitration is absent, and therefore, switches are simple. 
Furdiermore, it is possible to implement systolic algorithms using the crossbar because it can transfer data at the 
same or greater speed than required by the systolic computation. Such a crossbar is easily scalable. Unlike other 
intercoiuiections (such as cubes, shuffle-exchanges etc.), the scalability need not be in power of 2. A unit scalability 
is possible. Furthamore, due to the same reason, it is easy to provide fault-tolerance because one spare processor 
can replace any failed processor, and one extra crossbar link can replace any failed link. This is possible because 
there is no inherent structure that connects the processor and each processor (link) is topologically equivalent to any 
other processor (link). 

3.1.2. Scalability of Crossbar 

Figure 4(a) depicts a 1 bit 4x4 crossbar switch. In order to obtain byte or word parallel crossbar, the crossbar 
switches can be stacked together as shown in figure 4(b). The control, address and communication pattern informa- 
tion is exactly the same in all the stacked switches. Figure 4(c),(d) and (e) illustrate the size scalability. Figure 4(c) 
shows how a 4x8 crossbar can be obtained from two 4x4 crossbars. Similarly, figure 4(d) and (e) illustrate how 
8x4 and 8x8 crossbars can be obtained respectively. Note that the smallest switch need not be a bit crossbar. 
Depending on the technology and availability of the I/O pins, it can be of any size (such as 4 bit or a byte). Further- 
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iiKKe, (tepending on the available pins, it can be a 16x16 or 32x32 bit crossbar. Rnally, sizes of the crossbar 
need not be a multiple of two but can be any arbitrary. 

3.2. The DSP Hierarchy 

The DSP-tree is an n-tree with nodes conesponding to DSPs and edges to W-dir«:tional communication links. 
Parh DSP node is composed of a processor, a buffer memory, and a corresponding controller. 

The tree structure has two primary functions. First it represents the control hia^hy for the multipocessor. A 
DSP serves as a controllCT for the subtree structure under it Each task starts at a node on an ^propriate level in the 
tree, and is recursively distributed at each level of the sub-tree under the node. At die bottom of the tree, the sub- 
tasks are executed on a processor cluster in the desired mode (SIMD ot MIMD) and under the supervision of the 

leaf DSP. 





e) 8X8 


Figure 4 : Scalability of Crossbar 


/ 
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The second function is that of distributing the programs to leaf DSPs and the PEs. Vision algorithms are 
characterized by a large number of identical parallel processes that exploit the spatial paraUelism and operate on dif- 
foent Hata sets. It would be highly wasteful if each PE issued a separate request Cw its copy of the program block 
to the global memory because it would result in large unnecessary traffic through the interconnection netwwk. 
Under the DSP-hrerarchy approach, one copy of the program is fetched by the controlling DSP (the DSP at the root 
of the tatk subtree) and then broadcast down the subtree to the selected PEs. Also, DSP hierarchy provides com- 
munication paths between clusters to transfer control information or data from one clusto' to others. Finally, the 
DSP-tree is teqronsible for Global Memory management 

33. Global Memory 

The multiport global memory is a parallel-pipelined structure as introduced in [28]. Given a memory(chip)- 
access-time of T processor-cycles, each line has T memory modules. It accepts a request in each cycle and responds 
after a delay of T cycles. Since an L-port memory has L lines, the memory can support a bandwidth of L words per 

cycle. 

Data and programs are organized in memory in blocks. Blocks correspond to "units” of data and programs. 
The size of a block is variable and is determined by the underlying tasks and their data structures and data require- 
ments. A large number of blocks may together constitute an entire program or an entire image. Memory requests 
are for blocks. The PEs and DSPs are connected to the Global Memory with a multistage interconnection 
netwOTk. 

The global memory is capable of queuing requests made for bk)cks that have not yet been written into. Each 
line (or port) has a Memory-line ControUer (MLQ which maintains a list of read requests to the line and services 
them when the block arrives. It maintains a table of tokens corresponding to blocks on the line, together with their 
length, virtual address wid full! empty status. The MLC is also responsible for virtual memory management func- 
tions. 

Two main functions of the global memory are input-output of data and program to and from the DSPs and 
processor clusters, and to provide intercluster communication between various tasks as well as within a task if a task 
is mapped onto more than one cluster. 
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3.4. Global Interconnectioo 

The PEs and the DSPs arc connected to the GlobaJ Memory using a multistage circuit-switching interconnec- 
tion network. Data is transferred through the network in pages. A page is transferred from the global memory to the 
processors which is given in the header as a destinaUon port address and the header also contains the starting 
of the page in the global memory. When the data is writtCT into the global memory , <Mtly starting address 
needs to be sta ted in each case, end-of-page may be indicated using an extra flag bit ^ipended to each word. 

We arc evaluating an alternative strategy to connect DSPs, clustos and the global meiiKwy using a high speed 
bus. In this organization one port of each cluster will be connected to the high speed bus. Also, each DSP will be 
coimected to the bus. Processors that need to communicate with processors in other clusters use explicit messages 
to send and receive data from the other processors. Figure .5 illustrates this method. Aj^ocessor /*,• in cluster C,- 
can send d ata to a processor Pj in cluster Cj as shown in the figure. P j sends the data to the DSPi which sends the 
data to DSPj in a burst mode. DSPj then sends the data to the processor Pj. We arc evaluating both alternatives 
for intercluster communication. 


GLOBAL BUS 



Figure 5 : An Alternative Strategy for Inter-Cluster Communication 
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33. IVS Computation Requirements and NETRA 

In the following discussion we examine NETRA’s architecture in the light of requirements for an IVS dis- 
cussed in the previous section. 

Reconfigurability (Computation Modes) 

The clusters in NETRA provide SIMD, MIMD and systolic capabiUties. As we discussed earlier, it is desir- 
able to have these modes of operations in a multiprocessor system f<w IVS so that all levels of algorithms can be 
ex e cu te d efficiently. For example, consider matrix multiplication opoadon. We will show how it can be perfOTmed 
in SIMD and systolic modes. Let us assume that the computadon requires obtaining matrix C =AyB. For simpli- 
city, let’s assume that the cluster size is P and the matrix dimensions are P . Note that this assumpdon is made to 
simplify the example descripdon. In general, any arbitrary size computadon can be performed independent of the 
data or cluster size. 

SIMD Mode 

The algorithm can be mapped as follows. Each processor is assigned a column of the B matrix, i.e., processor 
Pi is assigned column Bi. Then the DSP broadcasts each row to the cluster processor which compute the inner pro- 
duct of the row with their corresponding column in lock-step fashion. Note that the elements of the A matrix can be 
continuously broadcast by DSP, row by row without any interruptions, and therefore, efficient pipelining of data 
input, muldply, accumulate operadons can achieved. Rgure 6(a) illustrates a SIMD configuradon of a cluster. The 
following pseudo code describes the DSP and processor (Pjf’s program. 0^ ^ -1) program. 

SIMD Computation 



DSP 

Pk 

1. 

FOR 1=0 to i=P-l DO 

1. 

2. 

connect(DSP P i) 

2. - 

3. 

out(columnBi) 

3. in(columnBi) 

4. 

ENDJOR 

4. - 

5. 

connect(DSP, all) 

5. - 

6. 

FOR i=0 to i=P-I DO 

6. Cit=0 

7. 

FOR j=0 to j=P-l DO 

7. FORj=Otoi=P-l DO 

5. 

out(aij) 


9. 

END FOR 

9. Cuc = Cu,+aij*bjk 

10. 

END FOR ~ 

10. END FOR 
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In the above code, the computation proceeds as follows. In first three lines, the DSP connects with each pro- 
cessor through the crossbar and writes the column on the output p(Mt That column is input by the corresponding 
jMOcessor, In statement 5, the DSP connects with all the processws in a broadcast mode. Then from statement 6 
onwaixls, the DSP broadcasts the data from matrix A in row major (wder and each processtw computes the inner pro- 
duct with ea r** row. Hnally, each processor has a column of the ou^ut matrix. It should be mentioned that the above 
code describes the operation in principal and does not exactly depict the timing of operations. 

Systolic Mode 

The same computation can be performed in a systolic mo^. The DSP can reconfigure the cluster in a circular 
linear array after distributing columns of matrix B to processors as before. Then DSP assigns row A; of matrix A to 
processor P,-. Each processor computes the inner paoduct of its row with its column and at the same time writes the 
element of the row on the outout pat. This element of the row is input to the next processor. Tlwrefore, each pro- 
cessa receives the rows of matrix A in a systolic fashion and the computation is performed in the systolic fashion. 
Note that the computation and communication can be efficiently pipelined. In the code, it is depicted by statements 
7-10. element of the row is used by a processor and immediately written on to the output port, and at the same 
time, the processor receives an element of the row of the previous processor. Therefore, every P cycles a processor 
computes new element of the C matrix from the new rows it receives every P cycles. Again, note that the code 
describes only the logic of the computation and does not include the timing information. Figure 6(b) illustrates a 
systolic configuration of a cluster. 


Systolic Computation 
DSP Pi 


1. 

FOR i=0 to i=P-l DO 

1. 

2. 

conneci(DSPj^i) 

2. - 

3. 

out(column Bi) 

3. in(columnBi) 

4. 

ovU(rowAi) 

4. iit(column Ai) 

5. 

END FOR 

5. - ^ 

6. 

connect(Pi to Pi^\ mod P) 

6. Cii=0 

7. 

7. FOR M to j=P-l DO 

5. 


8. Cii = Cii + aij*bn 

9. 


9. out(aii), in(ai-ij) 

10. ENDJOR 

10. 


11. 


11. repeat 7-10 for each new row 
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In a companion paper we present several examples of mapping different algorithms in different modes on the 
clusters as well as their performance evaluation. 

Partitioning and Resource Allocation 

There are several with vastly different characteristics in an FVS, and therefore» the number of processors 
needed for each task may be different and may be needed in different computational modes. Hence, partitionability 
and dynamic resource allocation are keys to high performance. Partioning in NETRA can be achieved as follows. 
When a task is to be allocated, the set of subtrees of DSPs is idoitified such that the required number of PEs is avail- 
able at their leaves. One of the subtrees is chosen on the basis of characteristics of the task. The chosen DSP 
represents the root of the control hierarchy for the task. Together with the DSPs in its subtree, it manages the execu- 
tion of the Note that partitioning is only virtual. The PEs arc not required to be physically isolated from the rest 
of the system. Once the subtree is chosen, the processes may execute in SIMD, MIMD or systolic mode. The fol- 
lowing are some of the advantages of such a scheme. Firsdy. only one copy of the programs needs to be fetched 
thereby reducing the traffic through the global interconnection network. Secondly, simple load balancing techniques 
may be employed while allocating tasks (examples are discussed in a companion p^r). The tasks of global 
memory management can be distributed over the DSP tree by assigning it to the DSP at the root of the subtree exe- 
cuting the subtask. Finally, locality is maintained within the control hiCTarchy, which limits the intratask 



a) SIMD Mode 



b) Systolic Mode 

Figure 6 : An Example of SIMD and Systolic Modes of Computation in a Cluster 
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communicaUon to within the subtree. 

Load Balancing and Task Scheduling 

Two levels of load balancing need to be employed, namely, global load balancing and local load balancing. 
Global load balancing aids in partitioning and allocating the resources for tasks as discussed earlier. Local load 
balancing is used to distribute computations (or data) to processors executing subtasks of a larger task. Local load 
balancing can be either stadc or dynamic or a combination of both. With stadc load balancing, given a task, its 
associated data aid the number of processors aUocated for the task, the data is partitioned in such a way that each 
processOT gets equal or comparable amounts of computation [27]. In dynamic load balancing, the subtasks are 
dynamically assigned to the processws as and when they finish the previously assigned tasks. In NETRA when a 
tacif is assigned to a subtree, the DSPs involved perform the local load balancing fuiKtions. 

Using the information from local load balancing and other measures of computation, global load balancing 
can be achieved hierarchically by using the DSP hierarchy. In this scheme, each controUer DSP sends its measure of 
load to its parent DSP and the root DSP receives the load information for the entire system. The root DSP then 
broadcasts the measure of load of the entire system to the DPSs. When a task is to be allocated, these measures can 
be used to select a subtree for its execution as follows: If any subtree corresponding to the child of the current DSP 
has an number of processors then the task is transferred to a child DSP with the lowest load, else if the 

current subtree has enough resources and the load is not significantly greater than the average system load then the 
is allocated to the current subtree, else the current DSP transfers the task to the parent DSP. 

Flexible Crunmunication 

Availability of flexible communication is critical to achieving high performance. For example, when a parti- 
tion operates in SIMD mode there is a need to broad^t the programs. When a partition operates in MIMD mode, 
where processors in the partition cooperate in the execution of a task, one or more jKOgrams need to be transferred 
to the local monwies of the processors. Performing the above justifies the need for selective broadcast capabUity. In 
order to take advantage of spatial parallelism in vision tasks processors working on neighbonng data need to com- 
municate fast amongst themselves for high performance. The programmabiUty and flexibUity of the crossbar pro- 
vide fast local communication. Most common vision algrxithms such as FFTs, filtering, convolution, counting, 
transforms, etc., need a broad range of processor connectivities fw efficient execution. These connectivities include 
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airays, pipelines, several systolic configurations, shuffle-exchanges, cubes, meshes, pyramids etc. Each of these con- 
nectivities may poform well for some tasks and badly for others. Therefore, using a crossbar with a selective broad- 
cast capability, any of the above configurations can be achieved, and consequently, optimal performance can be 
achieved at the clusters. 

Several techniques for implementing reconfigurability between a set of PEs were studied [29,30]. It was 
discovoed that using a crossbar switch to connect all TOs was simpler than any other schemes. The popular argu- 
ment that crossbar switches arc expensive was easily thwarted. When designing communication networks in VLSI, 
the primary constraint is the number of pins and not the chip area. The number of pins is governed by the number of 
ports on the network and is independent of the type of network. Furthermore, it was realized that a crossbar with a 
selective broadcast capability was not only very powerful and flexible structure, but was also simpler, scalable and 
less expensive. 

The need for global communication is relatively low and infiequent Global communication is needed for 
intertask communication, i.e., from one task to another in the FVS pipeline. It is also needed to input and output data, 
to transfer data within a subsystem when a task is executed on more than one cluster, and finally, it is needed to load 
the programs. The most important issue in global communication is that the netware speed should be matched with 
the crossbar speed as well as with the processors speed. The global communication is performed through the global 
memory using the interconnection network, or using the DSP hierarchy. Another alternative we consider is connect- 
ing all the clusters and DSPs to a global bus. Since the DSPs perform most control functions and loading of pro- 
grams and data, the responsibility of intertask communication does not lie with hierarchy. In a Section 5 we present 
an extensive analysis of the global communication networks in NETRA. Then using the analysis developed here we 
present performance of several algorithms in a companion paper. 

4. Mapping Parallel Algorithms 

There arc two main considerations in mapping the parallel algorithms. Rra, mapping individual tasks or 
algorithms, and the second, integration of various tasks. Mapping individual tasks involves efficient division of the 
task(s) on the available processors, intratask communication, load balancing and, input and output of data. If the 
task is mapped onto more than one processor cluster then the mapping wUl require both intra-cluster as well as 
inter-cluster communication. Integration of algorithms involves intertask communication, data transfer between 
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fonnatting the data for the next task, and global load balancing. 

The methodology we use for mapping [^rallel algorithms is multi-dimensional divide-and-conquer with 
medium to large grain parallelism. An individual task (in the following discussion task and algorithm are used inter- 
changeably) can be efficiendy mapped using spatial paraUelism because most of the vision algorithms are performed 
on two dimensional data. However, integration of tasks involves exploiting both qratial as well as temporal parallel- 
ism. Temptwal parallelism can be exploited by recognizing intertask data dependencies. In NETRA, by providing 
virtual partitioning of processors, reconfigurability, flexibility of communication and distributed control, it is possi- 
ble with much ease to exploit temporal paralleUsm available in integrated vision systems. Furthermore, temporal 
parallelism can be improved by making data available to the next task in the pipeline as soon as it is produced by the 
previous task. This is achieved using the macro data flow approach between tasks. 

4.1. Classification of Common Vision Algorithms 

We can classify some of the common vision algorithms according to their communication requirements when 
mapped onto parallel processors. The classification provides an insight to the performance of an algorithm depend- 
ing on its communication requirements. 

(1) Local Fixed - In these algorithm, the output depends on a small neighborhood of input data in which the 
neighborhood size is nOTmally fixed. Sobel edge detection, image scaling and, thresholding are examples of 
such algorithms. 

(2) Local Varying - T jifp the local fixed algorithms, the output at each point depends on a small neighborhood of 
input However, the neighborhood size is an input parameter and is independent of the input image size. 
Convolutions, edge detection and most other filtering and smoothing operations are examples of such algo- 
rithms. 

(3) Global Fixed - In such algorithms each output point depends on the entire input image. However, the compu- 
tation is normally input data independent (i.e., computation docs not vary with the type of image and only 
depends on the size of the image). Two Dimensional Discrete Fourier Transform and Histogram computation 
are examples of such algorithms. 

(4) Global Varying • Unlike global fixed algorithms, in these algorithms the amount of computation and commun- 
ication depends on the image input as well as its size. That is, the output may depend on the entire image or 
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may depend on a part of image. In other words, the computation is data dependent Hough Transform and, 
Connected Component Labeling arc examples of such algorithms. In an image, a connected component may 
qran only a small region, or in the worst case the entire image may be one connected component (a spiral). 
Similarly, in case of the Hough transform for detecting lines, a line may span across image (meaning its votes 
must come from distant pixels or edges) or it may be localized. 

Mapping an Algorithm on One Cluster 

Mapping a task on one cluster means that intratask communication will only involve communication between 
processors of the same clusters. Figure 7 shows how a paraUel algorithm is mapped on a cluster. Let us assume that 
there are P processes in a cluster. As shown in figure 7, first program and data are loaded onto the processor clus- 
ter. Both in case of SIMD or MIMD mode, the program is broadcast onto the cluster processors. The data division 
depends on the particular algorithm. If algorithms are mapped in SIMD or systolic mode then the compute and 
rorninunirari nn cycles will be intermixed. If the algorithms are mapped in MIMD mode then each processor com- 
putes its partial results and then communicates with others to exchange or merge d ata , 

Let us assume that an algorithm is mapped on one cluster with P processors. The total processing time in such 
a mapping consists of the foUowing components. Program load time onto the cluster processors (/^/), data load and 
partitioning time (fjf), computation time of the divided subtasks on the processors (fep) which is the sum of the 
maximum inocessing time on a processor Pi and intra-cluster communication time result report 

timft {trr). tji consists of three components: 1) data read time from the global memory {t^) by the cluster DSP, 2) 
crossbar switch setup time and, 3) the data broadcast and distribution time onto the cluster processors {1},^). 
The total processing time %{P) of the parallel algorithm is given by 

“^{P) = tpl + + ^cp + W 

where, 

t(U ~ ^setup ^br 

and if the computation and communication do not overlap then , 

tep ~ MAX (tpi) + tconun (5) 

else if computation and communication can completely overlap then. 
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Figure 7 : Mapping Algorithins on One Cluster 

tcp =MAX i) 9 ^comm ) (6) 

^ HOP 

In the above equations tf depends on the effective bandwidth of the global interconnection network. 

4«3. Mapping an Algorithm on More than one Cluster 

If an algorithm is mapped on more than one cluster then the communication consists of intra-cluster commun- 
ication as well as inter-cluster communication. Since the cost of intCT-clust^ communication is more than that of 
intra-clus^ communication, the in^-cluster communication should be minimized while mapping a parallel algo- 
rithm more than one cluster. Figure 8 shows how a typical algorithm will be mapped onto two clusters. 

Figure 5 shows how processor P,- in cluster Q will communicate with processor P^ in clust^ Cj using the 
global bus. The communication wiU be performed expliciUy by messages. Processor P, will send the data and the 
identification of the receiving processor to its DSP {DSP /), The DSP then will forward the message to the 
cc^responding DSP in the other cluster {DSPj). DSP j will send the data to Py. A processes* P,* in cluster C, will 
ccKnmunicate with a processor P j in cluster C j using the global memory as shown in figure 9. Processor Pj will 
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Next Task Next Task 

Figure 8 : Mapping Algorithms on Multiple Clusters 

write the Hata into a specified memory location in the global memory using its port connected to the multistage inter- 


connection network. Processor P j will request for the data from the Memory Line Controller on its port connected 
to the interconnection network. The basic difference between the two approaches is the available bandwidth through 
the global inteiCOTnection. In case of the bus, the effective bandwidth wiU be much lower compared to the effective 
bandwidth of the multistage interconnection network because only one processor ot DSP can access the bus at a 
time. 


The performance of the algorithm mtqiped on more than one cluster depends on how much intw-cluster com- 
munication is required and by how many processors, which in turn depends on the type of algorithm. Figure 9 illus- 
trates various ways in which an algorithm can be mapped on two or more clusters. HowevCT, in the figure we only 
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show two clusters. Case a) represents the best case in which a paiaUel algorithm can be mapped such that only one 
or no processors need to communicate with any processor in the other cluster. This is obtained by partitioning the 
Hata in such a way that only one processor in each cluster gets the boundary data and the algoithm is such that com- 
munication is only exchanging boundary values. Example of such algorithms include local fixed and local varying 
algorithms. 

In an average case, some of the processors in one cluster ate requited to communicate with processors in the 
other cluster. Case b) in Figure 9 illustrates such a case. The figure shows that data processed by a few processors is 
needed by cwTcsponding processors in the other cluster. However, the figure does not show how the transfer of the 
Hata will take place. That depends on the chosen global interconnection network and how the algorithm is mapped. 
Global varying algorithms may need this type of communication. For example, the connected component labeling 
algorithm can be mapped in such a way that only those processors need to communicate across clusters which have 
boundary components. 





Figure 9 : Types of Intercluster Communication 
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The third case (c) represents the worst case. The worst case means that data produced by each processor is 
required by at least one processor in another cluster. Global fixed and some global varying algorithms need this 
type of communication. 

Now, let us discuss how inter-cluster communication is incorporated in the performance of an algorithm. Let 
us assume that an algorithm is mapped on C clusters. The total computation time for an algorithm (EQ. 5) is now 
given by 

ten = MAX ( Mj^ (tpi) ) + ^comm (7) 

^ l£jSC O&iiP 
now.fc«n« isgivenby 

^convn ~ fc/ ^icl ^ 

whoe, tel represents the intra-cluster communication time and, tid denotes the inter-cluster communication time. 

tjf-l not only depends on the type of algorithm, how it is mapped, how many clusters it is mapped on but also 
dq»nds on the effective bandwidth of the global interconnection. The effective bandwidth depends on the commun- 
ication requirements of the other tasks in the system which may be executing on other clusters. Let us assume that 
fic/ denotes the communication time when there is no interference in the interconnection network, that is, td 
denotes the write and read time given that the network is available whenever inter-cluster communication is 
needed. Then the actual inter-cluster communication will be degraded by a factor w which depends on the traffic 
intensity in the network and the interference by communication of other tasks in the system. Therefore, instead of 
tici, the inter-cluster communication time will be WXtjei. 

5. Inter-cluster Communication in NETRA 

Communication between processors within the same cluster is performed using the crossbar connections. 
r*Ammii ni ra tinn between pfocessors in different clusters can be performed in various ways. First, the global memory 
is used for this purpose as follows. The processor(s) needing to send data to anothw processor in a different cluster 
writes the <tata into designated locations in the memory. This involves setting the appropriate circuit through the glo- 
bal multistage interconnection network to the memory module followed by a data transfer. The data is transferred in 
block mode. The Memory Line Controller (MLC) updates the infnmation about the destination port(s), the length 
of the data block, block’s starting address and sets a flag indicating the availability of the data. Now, the destination 
processor can read the data using this information. Note that this method permits out of order requests to be ser- 
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viced. For example, if the destination processor tries to read the data before it has been written. MLC informs the 
processor of this situation and when the data is really written into the global memory then the MLC informs the des- 
tinaUon processor. This is a block level data-flow approach, -nic main advantages of this approach are that asyn- 
chronous communication is possible, out of order messages can be handled and efficient pipelining of data can be 
achieved. 

The second alternative to perform inter-cluster communication is to use the DSP tree links. However, for dis- 
tant intw-clustCT communications, the tree may not perform well because of the root bottlenecks typical to any tree 
structure. The main function of the tree structure is to provide control hierarchy fw the clusters. Its links are mainly 
used for jnogram and data broadcast to subtrees, and DSPs use the tree links to send (receive) control information to 
(from) other DSPs. 

The third alternative strategy to perform inter-cluster communication is to use a high speed global bus that 
frmnwtc all DSPs and one port from each cluster. The global memory is also connected to the bus and is accessible 
to all the clusters via the bus. The communication is done explicitly by messages and synchronously. Figures 10 and 
5 show the first and third communication methods. The figures show how a processor F,- of cluster C,- will com- 
municate with processor Pj of cluster Cj using the two strategies. 

Inter-cluster communication is needed in the followmg cases : i) An algorithm is mapped in parallel on more 
than one cluster and the processors need to communicate to exchange partial results or combine their results, ii) in 



Figure 10 ; Inter-Cluster Communication Using Global Memory 
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an integrated vision system, output data of a task produced at one or more clusters needs to be transferred to the next 
task executing on different clusters and, iii) to perform input and ouqwt of data and results. 

The extent of inter-cluster communication depends on the type of algorithms, how they are mapped in paral- 
lel, fitequency of communication and amount of data to be communicated. 

5.1. Analysis of Inter-Cluster Communication 

There are several parameters that affect the inter-cluster communication time. The architecture dependent 
parameters are: number of processors (i.e., number of clusters and number of processors in each cluster), number of 
mernwy modules, number of processors per port connected to the global interconnection, and the type of intercon- 
nection netwtMk. Some parameters depend both on the architecture as well as on the type of algorithms, how they 
arc mapped, their communication requirements when mapped onto multiple clusters etc. Furthermore, not only does 
the communication time depend on the underiying algorithms but also on the network traffic generated by other pro- 
cessors in the system because there may be conflicts in accessing the network as well as memory modules. 

We COTsider an equivalent model of the architecture as shown in figure 11. The model shows N processors 
rnnnfrtp/t to M memory modules through a global interconnection network. N is given by C xP, + where 
C is the number of clusters, P( is the number of ports in each cluster and Njgp is the number of DSPs in the system. 
For simplicity, we assume that each cluster contains equal number of processors. The number of physical processors 
will be given by C x/*, xPp , where Pp is the number of processors per port 



Figure 11 : Equivalent Model for Global Communication 
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The following analysis is based on the analysis presented by Patel in [31,32], He developed an analytical 
model for evaluating alternative processor memory interconnection performance and showed that the analysis is rea- 
sonably Consider execution of a typical parallel algorithm on multiple clusters. The execution will consist 

of processing, intra-cluster communication and inter-cluster communication. Figure 12 shows the computation and 
communication phases of an algorithm. The computation time is given by fq,, the intra-cluster communication time 
is given by td. and the intCT-cluster communication time is given by tid in terms of equivalent processor cycles. 
However, due to conflicts in the networic or in memory modules, a |MOCCSSor may have to wait for cycles before 
being able to access the network and write to (or read from) the memory. In effect, this can be seen as the communi- 
cation time being elongated by a factor w for each request, and instead of it being tjd, it is now wXtid as shown in 
figure 12. Therefore, if the probability of accessing the global network in each processing cycle is m and for each 
the communication time is ?jc/, then the useful computation for t processor cycles takes t -b mx.t'xwxtici, 
where t = tcp +td- The fraction of useful work (utilization U) is given by 

U = . ( 9 ) 

t + mxtxwxtici 

The average number of busy memory modules (or fraction of time when the bus is busy when the global intercon- 
nection is a bus) is 

^ ( 10 ) 

t + mxticiXtxw 

and in terms of utilization. 


f fc/ 




tep ^cl 





tep "b fc/ 





With Interference 

Figure 12 : Computation and Communication Activities of a Processor 
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B=Nxmxtu;ixU 

In [31] it is shown that the utilization primarily depends on the product ntxtid rather than m and 1,^/ indivi- 
dually. In other words, the processor utilization primarily depends on the traffic intensity and to a lesser extent on 
the nature of the traffic. 

For a particular algorithm, all the parameters are known except w. The probability of accessing the global 
network is essentially given by the number of times cwnmunicadon is needed per processor cycle and is known 
when an algorithm is mapped in parallel. The factor w depends on the algorithm parameters as well as the interfer- 
ence from other processors accessing the global network and the memory, number of processors, number of 
memory modules, the type of interconnection network and the access rate m. 

Consider the processor activities again. A processor needing to access the global memory or the bus submits 
requests again and again until accepted; on an average this happens for (w— time units. After the request is 
granted, the processor has a path to memory for tjci time units. In other words, the network sees an average of 
wySici consecutive requests for unit service time. Therefore, the request r^ (fOT unit service) from a processor as 
seen by the network is 


ntxwxtici 

m' - . 

1 + m'x.wxtici 

and in terms of utilization 

m'=l-U. 

For details, the reader is referred to [31]. 


( 12 ) 


The model that we analyze is a system of N sources and M destiiutions. Each source generates a request with 
probability nt' in each unit time. The request is independent, random, and uniformly distributed over all destina- 
tions. Each request is for one unit service time. The following is an analysis for a bus and for multistage delta net- 
work. 


Bus : We know from earlier discussion that 
B = Nxmxticiy^U 

and also, assuming all sources have the same request rate, average amount of time the bus is busy is given by 




( 14 ) 
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The equations 12 and 13 result in a non-linear equation 
NymxticixU =0. 

In the above equation, value of m' can be substituted in terms of w, and hence, value of w can be computed. 
If the request rate from sources is not uniform, i.e., if the request rate from source Ni is m,- then the above equation 
becomes 

- [ 1 - 'll (1 = 0. (16) 

M j=l 

When evaluating performance of a parallel algorithm mapped across clusters there will be two request rates, 
one for the processors taking part in runing the algorithm and the other for rest of the processors in the system which 
will be an input parameter. 

Multistage-Interconnection (Delta) : A delta network is an n stage networic constructed from axb crossbar 
switches with a rcsulting^ize of a"xd". Therefore,"F^" and M = For a ccSnpleter description refer to 
[32]. Functionally, a delta network is an interconnection network which allows any of N sources (processors) to 
communicate with any one of the M destinaUons (memory modules). However, two requests may coUide in the net- 
work even if the requests are made to different memory modules. We use results from[32, 31] to obtain average 
number of busy main memory modules B, which is given by 

B=Mxm„ 

and the following equation in satisfied. 

Nymyt^xU - Myrrin = 0 

where, 

mi+i = l-(l-y )^0^'<n 
mo = l-U. 

Fw details, the reader is referred to [31,32]. 

These equations are solved numerically to obtain the interference delay factor w which is used in the perfor- 
mance evaluation of algorithms mapped across multiple clusters. 
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5^. Approach to Performance Evaluation of Algorithms 

Performance of an algorithm mapped on multiple clusters is governed by various factors. Table 1, summarizes 
the parameters affecting the performance of a parallel algorithm. The approach to evaluating the performance of an 
algorithm is as follows. Using the parameters and a particular mapping, computation {tcp), intra-cluster communi- 
cation {fci) and inter-cluster communication time ifid) are determined. The traffic intensity for a processOT(s) (or a 

^vcl j • 

cluster depending on how an algwithm is mapped) is given by ; — . Using the traffic intensity values, and using 

fep+fd 

a range of traffic intensity values for interference, the effective bandwidth <rf the networic is determined, that is, the 
factor w is computed. In a companion paper, we will present performance evaluation of several algorithms using the 
above method. 

Consider a parallel execution of an algorithm across clusters. If the execution time when the algorithm is exe- 
cuted on a single processor is tjgq then the speed up in the best case is given by 


Sp = 


*seq 


(19) 


fcp ^cl ^icl , 

That is, assuming there is no interference while accessing the network or the global memory. Under the condi- 
tions in which there are conflicts while accessing the network, the inter-cluster communication time will be given by 


wxtici. and therefore, the speed up will be given by 


5p' = 


^seq 

tcp “ 1 " tcl “ 1 " ^^tici 


( 20 ) 


Table 1 : Parameters for Performance Evaluation 


p 

No. of proc, executing an algorithm 

c 

Ouster size 

N 

Total no. of proc. in the system 

D 

Data size 

P, 

No. of procyport 

M 

No. of memory modules 

GICN 

Type of Klobal interccnmection 

mxt 

Traffic intensity for interference in network 
and memory accesses by (N—P) processors 

ntiXti 

Traffic intensity for network and 

memory access by partition executing the algorithm 
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Hence, degradation in speed up with respect to the best case speed up will be 

5/7 — Sp' _ (tv— ^ 21 ) 
Sp ~ tcp + tci + wx/ic/ ' 

6. Summary 

In this paper we presented a model of computation fw IVSs. Using the model desired features and capabili- 
ties of a parallel architecture for IVSs were derived. Then a multijMpce^ architectuie suitable for IVS (caUed 
NETRA) was i»esented. The U^wlogy of NETRA is recursively defined and, hence is easUy scalable from small to 
large systems. Homogeneity of NETRA permits fault tolerance and graceful degradation untter faults. NETRA is a 
recursively defined tree-type hierarchical architecture whose leaf nodes consist of cluster of processors connected 
with a programmable crqssb^ with selective broadcast capability to provide for desired flexibUity. We presented a 
qualitative evaluation is NETRA. Then general schemes were described to m^ parallel algorithms onto NETRA. 
Rnally, an analysis to evaluate alternative inter-cluster communication strategies in NETRA was presented with a 
methodology to evaluate performance of parallel algorithms mtqtped across multiple clusters. 

In a cwnpanion pap)er (part II of this paper) we present performance evaluation of several common vision 
algorithms on NETRA. The paper discusses performance of algorithms on one cluster, their analysis and implemen- 
tation. FurthCTmore, the paper iiKludes performance evaluation of alternative communication strategies as well as 
presents mapping of algorithms across multiple clusters. The effect of intoference in the global interconnection net- 
work and global memory on the performance of algorithms is also studied. 
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